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ABSTRACT 


Classification is a form of data analysis that can be used extract models 
describing important data classes or to predict future data trends. 
Classification is the process of finding a set of models that describe and 
distinguish data classes or concepts, for the purpose of being able to use the 
model to predict the class of objects whose class label is unknown. In 
classification techniques, Naive Bayesian Classifier is one of the simplest 
probabilistic classifiers. This paper is to study the Naive Bayesian Classifier 
and to classify class label of paddy type data using Naive Bayesian Classifier. 
This paper predicts four class labels and displays the selected impacts 
attribute of each class label by using Naive Bayesian classifier. Moreover, this 
paper can predict the types of paddy for paddy dataset by using other 
classification methods such as Decision Tree and Artificial Neural Network. 
Furthermore, this system can be used to predict production rate and display 
the selected impacts attribute of other crops such as soybeans, corns, cottons. 
This paper focuses on paddy dataset and decides paddy types are Lasbar or 
Yar Sabar or Yenat Khan Sabar or Sar Ngan Khan Sabar. 



KEYWORDS: Naive Bayesian, Paddy types, Classification, Large dataset 

1. INTRODUCTION 

Computers are widely used in Education, health, Arts, Humanities, Social 
Science, Industry, Communication, Government, Administration, Research, 
Business sectors and Agricultures. 
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Computer systems are stored large amount of data, are often 
required not only to retrieve but also to classify that data 
rapidly in a variety of sequences, combinations and 
classification. Data mining is one of the computer-aided 
systems such classification prediction, etc. It refers to 
extracting or mining knowledge from large amounts of data. 
Classification is an important technique in data mining. A 
classification model can also be used to predict the class label 
of unknown records. The paddy types are identified as 
definitely Lasbar, definitely Yar Sabar, definitely Yenat Khan 
Sabar and Sar Ngan Khan Sabar. Paddy types have numbers 
of instances and numbers of attributes. Paddy data is large 
dataset. Along with decision trees and neural networks, 
Bayesian Classifier is one of the most practical and most used 
learning methods. When to use: 

> Moderate or Large Training set available, 

> Attributes that describe instances are conditionally 
independent given classification. 

So, this paper is to implement Bayesian Classifier using paddy 
types. 

2. RELATED WORK 

Herry Zhang proposed the sufficient and necessary 
conditions for the optimality of naive Bayes. He investigated 
the optimality of naive Bayes under the Gaussian distribution. 
W. Zhang and F. Gao described an auxiliary feature method is 
proposed. It determines features by an existing feature 
selection method, and selects an auxiliary feature which can 


reclassify the text space aimed at the chosen features. Then 
the corresponding conditional probability is adjusted in order 
to improve classification accuracy. They show that the 
proposed method indeed improves the performance of naive 
Bayes classifier. J. Ren proposed a novel naive Bayes 
classification algorithm for uncertain data. His key solution is 
to extend the class conditional probability estimation in the 
Bayes model. Extensive experiments on UCI datasets show 
that the accuracy of naive Bayes model can be improved by 
taking into account the uncertainty information. Toon Calders 
and Sicco Verwer investigated howto modify the naive Bayes 
classifier in order to perform classification that is restricted 
to be independent with respect to a given sensitive attribute. 

3. THEORETICAL BACKGROUND 

Data mining is the process of digging or gathering 
information from various databases. The data mining should 
have been more appropriately names knowledge mining 
from data. Data mining involves the use of sophisticated data 
analysis tool to discover previously unknown, valid patterns 
and relationships in large datasets. These tools can include 
statistical models, mathematical algorithms, and machine 
learning methods. 

Data mining can be performed on data represented in 
quantitative, textual, or multimedia forms. Data mining 
applications can use a variety of parameter to examine the 
data. They include association, sequence or path analysis, 
classification, clustering and forecasting. 
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3.1 KINDS OF DATA MINING 

A number of different data stores on which mining can be 
performed. In principle, data mining should be applicable to 
any kind of information repository. This includes relational 
databases, data warehouses, transactional databases, 
advanced database systems, flat files, and the World Wide 
Web. Advanced database systems include object-oriented 
and object-relational databases, and specific application- 
oriented databases, such as spatial databases, time-series 
databases, text databases and multimedia databases. The 
challenges and techniques of mining may differ for each of 
the repository systems. 

3.2 TYPES OF DATA MINING 

In general, data mining tasks can be classified into two 
categories: descriptive and predictive. 

1. Descriptive tasks: The objective is to derive patterns 
that summarize the underlying relationships in data. 
Descriptive data mining tasks are often exploratory in 
nature and frequently require post-processing 
techniques to validate and explain the results. 

2. Predictive tasks: The objective of these tasks is to 
predict the value of a particular attribute based on the 
other attributes. The attribute to be predicted is 
commonly known as the target or dependent variable, 
while the attributes used for the explanatory or 
independent variables. 

Descriptive mining tasks characterize the general properties 
of the data in the database. Predictive mining tasks perform 
inference on the current data in order to make predictions. 
The kinds of data mining are Association Analysis, 
Classification and prediction, Cluster Analysis, Outlier 
Analysis, and Evolution Analysis. Data mining systems can be 
categorized according to various criteria, as follows. 

> Classification according to the Kinds of Database Mined 

> Classification according to the Kinds of Knowledge 
Mined 

> Classification according to the Kinds of Techniques 
Utilized 

> Classification according to the Application Adapted 

4. CLASSIFICATION 

Classification is a data mining (machine learning) technique 
used to protect group membership for data instances. Data 
classification is a two-steps process. In the first step, a model 
is built describing a predetermined set a data classes or 
concepts. 

Data classification is a two-step process. In the first step, a 
model is built describing a predetermined set of data classes. 
The model is constructed by analyzing data tuples described 
by attributes. Each tuple is assumed to belong to a 
predefined class, as determined by one of the attributes, 
called the class label attribute. In the context of classification, 
data tuples are also referred to as samples, examples or 
objects. 

In the second step, the model is used for classification. First, 
the predictive accuracy of the model or classifier estimated. 
If the accuracy of the classifier is considered acceptable, the 
model can be used to classify future data tuples for which the 
class label is not known. Such data are also referred to as 
"unknown" or "previously unseen" data. 


Any classification method uses a set of features or 
parameters to characterize each object, where these features 
should be relevant to the task at hand. We consider here 
methods for supervised classification, meaning that a human 
expert both has determined into what classes an object may 
be categorized and also has provided a set of sample objects 
with known classes. This set of known objects is called the 
training set because it is used by the classification programs 
to learn how to classify objects. 

There are two phases to constructing a classifier. In the 
training phase, the training set is used to decide how the 
parameters ought to be weighted and combined in order to 
separate the various classes of objects. In the application 
phased, the weights determined in the training set are 
applied to a set of objects that do not have known classes in 
order to determine what their classes are likely to be. 

If a problem has only a few (two or three) important 
parameters, then classification is usually an essay problem. 
For example, with two parameters one can often simply 
make a scatter-plot of the feature values and can determine 
graphically how to divide the plane into homogeneous 
regions where the objects are of the same classes. The 
classification problem becomes very hard, though, when 
there are many parameters to consider. 

Not only is the resulting high-dimensional space difficult to 
visualize, but there are so many different combinations of 
parameters that techniques based on exhaustive searches of 
the parameter space rapidly become computationally 
infeasible. Practical methods for classification always involve 
a heuristic approach intended to find a "good-enough" 
solution to the optimization problem. 

A classification model can also be used to predict the class 
label of unknown records. A classification model can be 
treated as black box that automatically assigns a class label 
when presented with the attribute set of unknown record. 
The classifier design can be performed with labeled or 
unlabeled data. Using a supervised learning method the 
computer is given a set of objects with known classification 
and is asked to classify an unknown object in the information 
acquired by it during the training phase. 

The classifier design can be performed with labeled or 
unlabeled data. Using a supervised learning method the 
computer is given a set of objects with known classification 
and is asked to classify an unknown object based on the 
information acquired by it during the training phase. 

4.1 METHODS INCLUDES IN CLASSIFICATION 

Classification methods are needed for processing the huge 
quantities of data generated by modern astronomical 
instruments. Some of methods include in classification are: 

> Decision trees 

> Bayesian classification 

> Classification by back-propagation 

> Classification based on concepts from association rule 
mining 

4.1.1 BAYESIAN CLASSIFICATION 

Bayesian Classification is based on Bayes' Theorem. Bayesian 
Classifiers are useful in predicting the probability that a 
sample belongs to a particular class or grouping. This 
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technique tends to be highly accurate and fast, making it 
useful on large databases. Depending on the precise nature 
of the probability model, Naive Bayes classifiers can be 
trained very efficiently in a supervised learning setting. In 
many practical applications, parameter estimation for Naive 
Bayes models uses the method of maximum likelihood; in 
other words, one can work with the Naive Bayes model 
without believing in Bayesian probability or using any 
Bayesian methods. 

In simple terms, a naive Bayes classifier assumes that the 
presence (or absence) of a particular feature of a class is 
unrelated to the presence (or absence) of any other feature. 
For example, a fruit may be considered to be an apple if it is 
red, round, and about 4" in diameter. Even if these features 
depend on each other or upon the existence of the other 
features, a naive Bayes classifier considers all of these 
properties to independently contribute to the probability 
that this fruit is an apple. Naive Bayes classifiers often work 
much better in many complex real-work situations than one 
might expect. 

An advantage of the Naive Bayes classifier is that it requires 
a small amount of training data to estimate the parameters 
necessary for classification. Because independent variables 
are assumed, only the variances of the variables for each 
class need to be determined and not the entire covariance 
matrix. 

4.1.2 NAIVE BAYESIAN CLASSIFIERS CHARACTERISTIC 

Naive Bayesian Classifiers generally have the following 
characteristics. 

They are roust to isolated noise points because such points 
because such points are averaged out when estimating 
conditional probabilities from data. Naive Bayes Classifiers 
can also handle missing values by ignoring the example 
during model building and classification. 

They are robust to irrelevant attributes. If Xi is an irrelevant 
attribute, then P (Xi/Y) becomes almost uniformly 


distributed. The class conditional probability for Xi has no 
impact on the overall computation of the posterior 
probability and C) Correlated attributes can degrade the 
performance of Naive classifiers because the conditional 
independence assumption no longer holds for such 
attributes. 

4.1.3 NAIVE BAYESIAN CLASSIFICATION EQUATION 

P(X) is constant for all classes, so finding the most likely 
class amounts to maximizing P(X/G) P(&). P (&) is the prior 
probability of class i. If the probabilities are not known, 
equal probabilities can be assumed. Assuming attributes are 
conditionally independent: 

P (X k /Ci) = n n k=1 P(X k /Ci) 

P (Xk/Ci) is the probability density function for attribute k. P 
(Xk/Ci) is estimated from the training samples. Estimate P 
(Xk/Ci) as percentage of samples of class i with value X k . 
Training involves counting percentage of occurrence of each 
possible value for each class. Also use statistics of the sample 
data to estimate P(X k /Ci). Actual form of density function is 
generally not known, so, Gaussian density is often assumed. 
Training involves computation of mean and variance for 
each attribute for each class Gaussian distribution for 
numeric attributes: 

^ (. x k-fici ) 2 

P(X k /Ci ) = -^=e 

y 2.7 Z(J c i 

Where, 

pCi is the mean of attribute k observed in samples of class Ci. 
SCi is the standard deviation k observed in samples of class 
Ci. 

5. IMPLEMENTATION 

In this training data set, there are 13 attributes. All of the 
attributes are normal attributes. There are 4 classes, 
Lasabar, Yasabar, Sar Ngan Khan Sabar and Yenat Khan 
Sabar. The following table describes name of attributes and 
description of these attributes. 


Tablel. Attribute Information 



Attribute Name 

Description 

1 . 

Plant genus 

Etmahta, Lethyaysin, Ngasein, Midone 

2. 

Plant Lifetime 

120-125,125-130,140-145,130-135,135-140,115-120 

3. 

Plant height 

4.0’-4.5’,3.0’-3.5’,3.5’-4.0’,4.5’-5.0’,2.5’-3.0’, 5.0’-5.5’, 5.5’-6.0’ 

4. 

Plant spike 

10-20,6-8,9-10,7-9,9-11,8-10,5-7,2-3, 4-6,10-15 

5. 

Spike length 

9.5”,11”,12”,10.5”,11”,9”, 8.5”,10.8”,12”,10” 

6. 

Seed number in a spike 

240,117,160,95,155,130,140,150,170,234,250 

7. 

Seed weight(lOOO) (g) 

25.5,21,26,29,28,25,22,24.5,25.5,19,16,23 

8. 

Productive rice(%) 

45,40,50,55,45,60,54,53,41,64,51,37,90,49,60 

9. 

Amilo (%) 

23,25,19,18,20,21,30,26,24 

10. 

Rice quality 

Kyilin, Baikphyupar, Nauk 

11. 

Consumption 

Fair, Soft,Good, Hard 

12. 

Light response 

Yes or No 

13. 

Production rate 

80-100,40-70,60-70,60-80,100,100-150,40-60, 30-50,100-120 
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5.1 ALGORITHM OF PADDY TYPES BASED ON NAIVE 
BAYESIAN CLASSIFICATION 

Algorithm: Naive Bayesian Classification. Predict class 
membership probabilities, such as the probability that a 
given sample belongs to a particular class. 

Input: Database, C, of the selected training samples dataset 
and unknown data X. 

Output: Predict class membership, Lasabar or Yasabar or Sar 
Ngan Khan Sabar or Yenat Khan Sabar. 

Method: 

K=total record count of training sample dataset C: 
for(i=0;i<Ck-l;i++) 

{ 

if(C.record fi).cell (Result) .value==Lasbar) 
LasabarCount++; 

else if (C.record(i).cell(Result).value==Yasabar) 
YasabarCount++; 

else if (C.record(i).cell(Result).value==Sar Ngan Khan Sabar) 

Sar Ngan Khan Sabar Count + + ;else if 

(C.record(i).cell(Result).value==Yenat Ngan Khan Sabar) 

Yenat Ngan Khan SabarCount++; 

} 

totalLasabarProb= LasabarCount/k; 
totalYasabarProb=YasabarCount/k; 

totalSar Ngan Khan SabarProb=Sar Ngan Khan 

SabarCount/k; 

totalYenat Khan SabarProb=Yenat Khan SabarCount/k; 

m=total record count of testing sample data except ID and 
Result fields; 

n=total cells count in each record of testing sample dataset T 
expect ID and Result fields; 

for(i=0;i<Tm-l;i++) 

{ 

LasabarProb=l; 

YasabarProb=l; 

Sar Ngan Khan SabarProb=l; 

Yenat Khan SabarProb=l; 
for(j=0;i<Tn-l;i++) 

{ 

LasabarCount=getCount (j,T.record(i).Cell(j).value, Lasabar); 
LasabarProb *=LasabarCount/ LasabarCount; 
YasabarCount=getCount (j,T.record(i).Cell(j).value, Yasabar); 
YasabarProb *=YasabarCount/ YasabarCount; 

Sar Ngan Khan SabarCount=getCount 

(j,T.record(i).Cell(j).value, Sar Ngan Khan Sabar); 

Sar Ngan Khan SabarProb*= Sar Ngan Khan SabarCount/ Sar 
Ngan Khan SabarCount; 

Yenat Khan SabarCount=getCount 

(j,T.record(i).Cell(j).value,Yenat Khan Sabar); 

Yenat Khan SabarProb*= Yenat Khan SabarCount/ Yenat 
Khan SabarCount; 

} 


if(LasabarProb> totalLasabarProb) 
display result as Lasabar; 
else if (YasabarProb> totalLasabarProb) 
display result as Yasabar; 

else if (Sar Ngan Khan SabarProb > totalSar Ngan Khan 
SabarProb) 

display result as Sar Ngan Khan Sabar; 

else if (Yenat Khan SabarProb > totalYenat Khan SabarProb) 

display result as Yenat Khan SabarProb; 

} 

Procedure getCount (collndex, colVal, result) 
k=total record count of training sample dataset C; 
count=0; 

for(i=0;i<Ck-l;i++) 

{ 

If(C.record(i).cell(colIndex).value==colVal&&C.record(i).cell( 

i).cell(Result).value==result) 

count++; 

} 

return count; 


5.2 SYSTEM FLOW DIAGRAM 



Figurel. Overview of System 


5. CONCLUSION AND FURTHER EXTENSION 

This paper has presented generating of classification from 
large datasets. This approach demonstrates efficiency and 
effectiveness in dealing with many datasets for classification. 
And then considered the classification problem by using 
Naive Bayesian Classification. The accuracy of dataset can 
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also assessed using Hold-out method. The relative 
performance of the Naive Bayesian Classifier can serve as an 
estimate of the conditional independence of attributes. This 
paper will extend Naive Bayesian classifier to work on the 
other data sets. Moreover, it can circulate the paddy type 
dataset by using other classifiers such as Decision Tree and 
Artificial Neural Network. 
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